Lesson 11: groupby

Author

최규빈

Published

July 25, 2023

강의영상

imports

import numpy as np
import pandas as pd

df.groupby

- 예제1: 아래의 예제에서 그룹별 평균을 구하여라.

_category = ['A']*5+['B']*5
_value = np.concatenate([np.random.randn(5), np.random.randn(5)+10])
df = pd.DataFrame({'category':_category, 'value':_value})
df
category value
0 A -1.115880
1 A 1.109800
2 A 0.364589
3 A 0.518465
4 A 1.571603
5 B 8.883637
6 B 12.943322
7 B 8.463850
8 B 10.164719
9 B 10.685973

- 방법1: groupby를 이용하지 않는 방법들

# 이렇게 해도 되고 
df[df.category == 'A'].value.mean(), df[df.category == 'B'].value.mean()
(0.4897153568753304, 10.228300113047595)
# 이것도 가능
{s:df[df.category == s].value.mean() for s in set(df.category)}
{'A': 0.4897153568753304, 'B': 10.228300113047595}

- 방법2: groupby만 사용

grouped_list = list(df.groupby('category'))
grouped_list
[('A',
    category     value
  0        A -1.115880
  1        A  1.109800
  2        A  0.364589
  3        A  0.518465
  4        A  1.571603),
 ('B',
    category      value
  5        B   8.883637
  6        B  12.943322
  7        B   8.463850
  8        B  10.164719
  9        B  10.685973)]
bundle1, bundle2 = grouped_list
group_name1, sub_df1 = bundle1
display(sub_df1)
category value
0 A -1.115880
1 A 1.109800
2 A 0.364589
3 A 0.518465
4 A 1.571603
for group_name, sub_df in df.groupby('category'):
    print(group_name)
    display(sub_df)
A
B
category value
0 A -1.115880
1 A 1.109800
2 A 0.364589
3 A 0.518465
4 A 1.571603
category value
5 B 8.883637
6 B 12.943322
7 B 8.463850
8 B 10.164719
9 B 10.685973
{group_name:sub_df.value.mean() for group_name, sub_df in df.groupby('category')}
{'A': 0.4897153568753304, 'B': 10.228300113047595}

- 방법3: groupby().aggregate()

grouped = df.groupby('category')
grouped.aggregate(np.mean)
value
category
A 0.489715
B 10.228300
df.groupby(['category']).aggregate(np.mean)
value
category
A 0.489715
B 10.228300

groupby(?)에서 올 수 있는 구조

  • 열의이름
  • [열의이름,열의이름]

aggregate(?)에서 올 수 있는 구조

  • 함수: 함수자체1가 오거나, 함수를 의미하는 문자열2 이 올 수 있음.
  • 리스트: [함수, 함수] # 여기에서 함수자리에는 함수자체, 혹은 함수문자열 아무것이나 올 수 있음.
  • 딕셔너리1: {열의이름:함수}
  • 딕셔너리2: {열의이름:[함수,함수]}
  • 1 np.mean,sum

  • 2 ‘size’, ‘count’, ‘sum’, ‘mean’, ‘median’, ‘min’, ‘max’, ‘std’, ‘var’

  • df.groupby(['category']).aggregate({'value':[np.mean,'size']})
    value
    mean size
    category
    A 0.489715 5
    B 10.228300 5

    pd.cut + df.groupby

    - 구간별 count

    arr = np.random.rand(1000)
    _bin = [0, 0.2, 0.5, 0.9, 1]
    _bin
    [0, 0.2, 0.5, 0.9, 1]

    - [0,0.2,0.5,0.9,1] 구간에 몇개의 숫자들이 있는지 알고 싶고, 구간별 평균도 알고싶다.

    pd.DataFrame({'cat':pd.cut(arr,_bin), 'arr':arr}).groupby('cat').aggregate({'arr':['size',np.mean]})
    arr
    size mean
    cat
    (0.0, 0.2] 189 0.099927
    (0.2, 0.5] 309 0.356922
    (0.5, 0.9] 407 0.702618
    (0.9, 1.0] 95 0.949145

    flights data

    df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/flights.csv')
    df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 58492 entries, 0 to 58491
    Data columns (total 14 columns):
     #   Column     Non-Null Count  Dtype  
    ---  ------     --------------  -----  
     0   MONTH      58492 non-null  int64  
     1   DAY        58492 non-null  int64  
     2   WEEKDAY    58492 non-null  int64  
     3   AIRLINE    58492 non-null  object 
     4   ORG_AIR    58492 non-null  object 
     5   DEST_AIR   58492 non-null  object 
     6   SCHED_DEP  58492 non-null  int64  
     7   DEP_DELAY  57659 non-null  float64
     8   AIR_TIME   57474 non-null  float64
     9   DIST       58492 non-null  int64  
     10  SCHED_ARR  58492 non-null  int64  
     11  ARR_DELAY  57474 non-null  float64
     12  DIVERTED   58492 non-null  int64  
     13  CANCELLED  58492 non-null  int64  
    dtypes: float64(3), int64(8), object(3)
    memory usage: 6.2+ MB

    - 예제1: 항공사(AIRLINE)별로 도착지연시간의(ARR_DELAY)의 평균을 구하라.

    df.groupby('AIRLINE').aggregate({'ARR_DELAY':'mean'})
    ARR_DELAY
    AIRLINE
    AA 5.542661
    AS -0.833333
    B6 8.692593
    DL 0.339691
    EV 7.034580
    F9 13.630651
    HA 4.972973
    MQ 6.860591
    NK 18.436070
    OO 7.593463
    UA 7.765755
    US 1.681105
    VX 5.348884
    WN 6.397353

    - 예제2: 항공사(AIRLINE)별로 비행취소건수(CANCELLED)의 합계를 구하라. 취소건수가 가장 높은 두개의 항공사는 어디인가?

    df.groupby('AIRLINE').aggregate({'CANCELLED':'sum'})
    CANCELLED
    AIRLINE
    AA 154
    AS 0
    B6 1
    DL 38
    EV 146
    F9 10
    HA 0
    MQ 152
    NK 25
    OO 142
    UA 93
    US 21
    VX 6
    WN 93

    - 예제3: 항공사(AIRLINE)별로 비행취소율(CANCELLED)을 구하라. 비행취소율이 가장 높은 두개의 항공사는 어디인가?

    df.groupby('AIRLINE').aggregate({'CANCELLED':'mean'})
    # df.groupby('AIRLINE').aggregate({'CANCELLED':lambda x : sum(x)/len(x)})
    CANCELLED
    AIRLINE
    AA 0.017303
    AS 0.000000
    B6 0.001842
    DL 0.003585
    EV 0.024923
    F9 0.007593
    HA 0.000000
    MQ 0.043791
    NK 0.016491
    OO 0.021554
    UA 0.011935
    US 0.013003
    VX 0.006042
    WN 0.011048

    - 예제4: 비행취소율이 가장 높은 두개의 항공사(AIRLINE)를 선택하라. 그 두 항공사에 대하여 요일별(WEEKDAY) 비행취소율(CANCELLED)을 조사하라.

    df.groupby('AIRLINE').aggregate({'CANCELLED':'mean'})
    CANCELLED
    AIRLINE
    AA 0.017303
    AS 0.000000
    B6 0.001842
    DL 0.003585
    EV 0.024923
    F9 0.007593
    HA 0.000000
    MQ 0.043791
    NK 0.016491
    OO 0.021554
    UA 0.011935
    US 0.013003
    VX 0.006042
    WN 0.011048

    MQ와 EV

    - 예제5: 아래는 운행거리의 요약통계량이다.

    df.DIST.describe()
    count    58492.000000
    mean       872.900072
    std        624.996805
    min         67.000000
    25%        391.000000
    50%        690.000000
    75%       1199.000000
    max       4502.000000
    Name: DIST, dtype: float64

    운행거리를 구간별로 [-np.inf,391,690,1199,np.inf]와 같이 나눈뒤 비행취소건수와 취소율을 구하여라.

    _bin = [-np.inf,391,690,1199,np.inf]
    df.assign(DIST2 = pd.cut(df.DIST,_bin)).groupby('DIST2').aggregate({'CANCELLED':['mean','sum']})
    CANCELLED
    mean sum
    DIST2
    (-inf, 391.0] 0.022659 334
    (391.0, 690.0] 0.013503 196
    (690.0, 1199.0] 0.013637 203
    (1199.0, inf] 0.010313 148

    Quiz: HRDataset_v14 자료분석

    아래의 코드를 활용하여 Kaggle의 HRdataset을 불러오라.

    df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/HRDataset_v14.csv')
    df
    Employee_Name EmpID MarriedID MaritalStatusID GenderID EmpStatusID DeptID PerfScoreID FromDiversityJobFairID Salary ... ManagerName ManagerID RecruitmentSource PerformanceScore EngagementSurvey EmpSatisfaction SpecialProjectsCount LastPerformanceReview_Date DaysLateLast30 Absences
    0 Adinolfi, Wilson K 10026 0 0 1 1 5 4 0 62506 ... Michael Albert 22.0 LinkedIn Exceeds 4.60 5 0 1/17/2019 0 1
    1 Ait Sidi, Karthikeyan 10084 1 1 1 5 3 3 0 104437 ... Simon Roup 4.0 Indeed Fully Meets 4.96 3 6 2/24/2016 0 17
    2 Akinkuolie, Sarah 10196 1 1 0 5 5 3 0 64955 ... Kissy Sullivan 20.0 LinkedIn Fully Meets 3.02 3 0 5/15/2012 0 3
    3 Alagbe,Trina 10088 1 1 0 1 5 3 0 64991 ... Elijiah Gray 16.0 Indeed Fully Meets 4.84 5 0 1/3/2019 0 15
    4 Anderson, Carol 10069 0 2 0 5 5 3 0 50825 ... Webster Butler 39.0 Google Search Fully Meets 5.00 4 0 2/1/2016 0 2
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    306 Woodson, Jason 10135 0 0 1 1 5 3 0 65893 ... Kissy Sullivan 20.0 LinkedIn Fully Meets 4.07 4 0 2/28/2019 0 13
    307 Ybarra, Catherine 10301 0 0 0 5 5 1 0 48513 ... Brannon Miller 12.0 Google Search PIP 3.20 2 0 9/2/2015 5 4
    308 Zamora, Jennifer 10010 0 0 0 1 3 4 0 220450 ... Janet King 2.0 Employee Referral Exceeds 4.60 5 6 2/21/2019 0 16
    309 Zhou, Julia 10043 0 0 0 1 3 3 0 89292 ... Simon Roup 4.0 Employee Referral Fully Meets 5.00 3 5 2/1/2019 0 11
    310 Zima, Colleen 10271 0 4 0 1 5 3 0 45046 ... David Stanley 14.0 LinkedIn Fully Meets 4.50 5 0 1/30/2019 0 2

    311 rows × 36 columns

    다음을 읽고 참 거짓을 판단하여라.

    1. 근무인원수가 가장 많은 인종(RaceDesc)은 ’White’이며 이는 ’Asian’인종과 ’Black or African American’의 합보다 많다.

    2. ’RaceDesc==White’의 성별(Sex)임금차이는 2000이상이다.

    3. 퇴직한사람(Termd==1)은 모두 104명이며 백인여성의 퇴직자수가 가장 많다.

    4. 퇴직한사람중 아시아인의 비율은 10%가 넘지 않는다.

    아래를 조사하라.

    5. [성별(Sex), 결혼유무(MarriedID)] 별 연봉(Salary)의 평균을 조사하라. 어떠한 그룹이 가장 평균연봉이 적은가?

    6. [성별(Sex), 인종별(RaceDesc)] 별 연봉(Salary)의 중앙값(‘median’)을 조사하라. 연봉의 중앙값이 가장 높은 그룹은 무엇인가?

    7. [성별(Sex), 인종별(RaceDesc)] 별 연봉(Salary)의 중앙값(‘median’)과 근무인원수를 함께 조사하라. 연봉의 중앙값이 가장 높은 그룹이 혜택을 받는 그룹이라고 느껴지는가?

    !git add . 
    !git commit -m .
    !git push
    !quarto publish --no-browser --no-prompt
    [main 7780095] .
     2 files changed, 1186 insertions(+), 6 deletions(-)
    Enumerating objects: 13, done.
    Counting objects: 100% (13/13), done.
    Delta compression using up to 16 threads
    Compressing objects: 100% (7/7), done.
    Writing objects: 100% (7/7), 2.07 KiB | 2.07 MiB/s, done.
    Total 7 (delta 4), reused 0 (delta 0)
    remote: Resolving deltas: 100% (4/4), completed with 4 local objects.
    To https://github.com/guebin/PP2023SUM.git
       a8c13c7..7780095  main -> main
    From https://github.com/guebin/PP2023SUM
     * branch            gh-pages   -> FETCH_HEAD
    Rendering for publish:
    
    [ 1/27] 1_1.qmd
    [ 2/27] about.qmd
    [ 3/27] 1_2.qmd
    [ 4/27] 2_2.qmd
    [ 5/27] posts/Day3/08_PandasBackend/ls21.ipynb
    [ 6/27] posts/Day3/08_PandasBackend/ls20.ipynb
    [ 7/27] posts/Day3/07_ggplot2/ls16.ipynb
    [ 8/27] posts/Day3/07_ggplot2/ls19.ipynb
    [ 9/27] posts/Day3/07_ggplot2/ls18.ipynb
    [10/27] posts/Day3/07_ggplot2/ls17.ipynb
    [11/27] posts/Day2/03_Pandas/ls07.ipynb
    [12/27] posts/Day2/03_Pandas/ls08.ipynb
    [13/27] posts/Day2/03_Pandas/ls09.ipynb
    [14/27] posts/Day2/04_Function, Iteration/ls10.ipynb
    [15/27] posts/Day2/04_Function, Iteration/ls11.ipynb
    [16/27] posts/Day2/05_Modulesls12.ipynb
    [17/27] posts/Day1/01_PythonBasicSyntax/ls02.ipynb
    [18/27] posts/Day1/01_PythonBasicSyntax/ls04.ipynb
    [19/27] posts/Day1/01_PythonBasicSyntax/ls01.ipynb
    [20/27] posts/Day1/01_PythonBasicSyntax/ls03.ipynb
    [21/27] posts/Day1/02_Numpy/ls05.ipynb
    [22/27] posts/Day1/02_Numpy/ls06.ipynb
    [23/27] 3_2.qmd
    [24/27] 3_1.qmd
    [25/27] index.qmd
    [26/27] 2_3.qmd
    [27/27] 2_1.qmd
    
    Preparing worktree (resetting branch 'gh-pages'; was at e9a8acb)
    Branch 'gh-pages' set up to track remote branch 'gh-pages' from 'origin'.
    HEAD is now at e9a8acb Built site for gh-pages
    [gh-pages 6edb56e] Built site for gh-pages
     24 files changed, 1144 insertions(+), 687 deletions(-)
    origin  https://github.com/guebin/PP2023SUM.git (fetch)
    origin  https://github.com/guebin/PP2023SUM.git (push)
    To https://github.com/guebin/PP2023SUM.git
       e9a8acb..6edb56e  HEAD -> gh-pages
    
    NOTE: GitHub Pages sites use caching so you might need to click the refresh
    button within your web browser to see changes after deployment.
    
    [✓] Published to https://guebin.github.io/PP2023SUM/
    
    NOTE: GitHub Pages deployments normally take a few minutes (your site updates
    will be visible once the deploy completes)